Programming Hive by Edward Capriolo

Programming Hive by Edward Capriolo

Author:Edward Capriolo [Edward Capriolo, Dean Wampler, and Jason Rutherglen]
Language: eng
Format: epub
Tags: COMPUTERS / Programming Languages / Java
ISBN: 9781449326975
Publisher: O'Reilly Media
Published: 2012-09-18T16:00:00+00:00


Accessing the Distributed Cache from a UDF

UDFs may access files inside the distributed cache, the local filesystem, or even the distributed filesystem. This access should be used cautiously as the overhead is significant.

A common usage of Hive is the analyzing of web logs. A popular operation is determining the geolocation of web traffic based on the IP address. Maxmind makes a GeoIP database available and a Java API to search this database. By wrapping a UDF around this API, location information may be looked up about an IP address from within a Hive query.

The GeoIP API uses a small data file. This is ideal for showing the functionality of accessing a distributed cache file from a UDF. The complete code for this example is found at https://github.com/edwardcapriolo/hive-geoip/.

ADD FILE is used to cache the necessary data files with Hive. ADD JAR is used to add the required Java JAR files to the cache and the classpath. Finally, the temporary function must be defined as the final step before performing queries:

hive> ADD FILE GeoIP.dat; hive> ADD JAR geo-ip-java.jar; hive> ADD JAR hive-udf-geo-ip-jtg.jar; hive> CREATE TEMPORARY FUNCTION geoip > AS 'com.jointhegrid.hive.udf.GenericUDFGeoIP'; hive> SELECT ip, geoip(source_ip, 'COUNTRY_NAME', './GeoIP.dat') FROM weblogs; 209.191.139.200 United States 10.10.0.1 Unknown

The two examples returned include an IP address in the United States and a private IP address that has no fixed address.

The geoip() function takes three arguments: the IP address in either string or long format, a string that must match one of the constants COUNTRY_NAME or DMA_CODE, and a final argument that is the name of the data file that has already been placed in the distributed cache.

The first call to the UDF (which triggers the first call to the evaluate Java function in the implementation) will instantiate a LookupService object that uses the file located in the distributed cache. The lookup service is saved in a reference so it only needs to be initialized once in the lifetime of a map or reduce task that initializes it. Note that the LookupService has its own internal caching, LookupService.GEOIP\_MEMORY_CACHE, so that optimization should avoid frequent disk access when looking up IPs.

Here is the source code for evaluate():

@Override public Object evaluate(DeferredObject[] arguments) throws HiveException { if (argumentOIs[0] instanceof LongObjectInspector) { this.ipLong = ((LongObjectInspector)argumentOIs[0]).get(arguments[0].get()); } else { this.ipString = ((StringObjectInspector)argumentOIs[0]) .getPrimitiveJavaObject(arguments[0].get()); } this.property = ((StringObjectInspector)argumentOIs[1]) .getPrimitiveJavaObject(arguments[1].get()); if (this.property != null) { this.property = this.property.toUpperCase(); } if (ls ==null){ if (argumentOIs.length == 3){ this.database = ((StringObjectInspector)argumentOIs[1]) .getPrimitiveJavaObject(arguments[2].get()); File f = new File(database); if (!f.exists()) throw new HiveException(database+" does not exist"); try { ls = new LookupService ( f , LookupService.GEOIP_MEMORY_CACHE ); } catch (IOException ex){ throw new HiveException (ex); } } } ...

An if statement in evaluate determines which data the method should return. In our example, the country name is requested:

... if (COUNTRY_PROPERTIES.contains(this.property)) { Country country = ipString != null ? ls.getCountry(ipString) : ls.getCountry(ipLong); if (country == null) { return null; } else if (this.property.equals(COUNTRY_NAME)) { return country.getName(); } else if (this.property.equals(COUNTRY_CODE)) { return country.getCode(); } assert(false); } else if (LOCATION_PROPERTIES.contains(this.property)) { .



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.